Lets say we have two ‘features’, let one be \(x\) and another \(y\). Recall that In linear regression we are looking to get model like:
\[y_i=\beta_0+\beta_1*x_i+\varepsilon_i\]
after the fitting, for each data point we would have: \[y_i=\hat{\beta_0}+\hat{\beta_1}*x_i+r_i\] where \(r_i\) is residual. It can be rewritten as:
\[\hat{\beta_0}+r_i=y_i-\hat{\beta_1}*x_i\;\;\;\;\;(1)\]
The first principal component \(z_1\) calculated on \((x,y)\) is \[z_{i1}=\phi_{i1}y_i+\phi_{i1}x_i\] Dividing it by \(\phi_{i1}\): \[\frac{z_{i1}}{\phi_{i1}}=y_i+\frac{\phi_{i1}}{\phi_{i1}}x_i\;\;\;\;\;(2)\]
There is a functional resemblance between two last equation (described linear relationship between \(y\) and \(x\)). Is following true: \[\hat{\beta_0}+r_i=\frac{z_{i1}}{\phi_{i1}}\] \[\frac{\phi_{i1}}{\phi_{i1}}=-\hat{\beta_1}\] Answer: Yes
What are the difference between coefficients optimization in linear regression and first PCA calculations?
Answer: In first PCA calculation, error squares are minimized by taking PCA orthogal to the straight such that the variance is zero while in linear regression, error squares are minimized in y direction.
(here should be the answere. help yourself with a plot)
In this exercise we will study UK Smoking Data (smoking.R, smoking.rda or smoking.csv):
Description
Survey data on smoking habits from the UK. The data set can be used for analyzing the demographic characteristics of smokers and types of tobacco consumed.
Format
A data frame with 1691 observations on the following 12 variables.
gender - Gender with levels Female and Male.
age - Age.
marital_status - Marital status with levels Divorced,
Married, Separated, Single and Widowed.
highest_qualification - Highest education level with
levels A Levels, Degree, GCSE/CSE, GCSE/O Level, Higher/Sub Degree, No
Qualification, ONC/BTEC and Other/Sub Degree
nationality - Nationality with levels British, English,
Irish, Scottish, Welsh, Other, Refused and Unknown.
ethnicity - Ethnicity with levels Asian, Black, Chinese,
Mixed, White and Refused Unknown.
gross_income - Gross income with levels Under 2,600,
2,600 to 5,200, 5,200 to 10,400, 10,400 to 15,600, 15,600 to 20,800,
20,800 to 28,600, 28,600 to 36,400, Above 36,400, Refused and
Unknown.
region - Region with levels London, Midlands & East
Anglia, Scotland, South East, South West, The North and Wales
smoke - Smoking status with levels No and Yes
amt_weekends - Number of cigarettes smoked per day on
weekends.
amt_weekdays - Number of cigarettes smoked per day on
weekdays.
type - Type of cigarettes smoked with levels Packets,
Hand-Rolled, Both/Mainly Packets and Both/Mainly Hand-Rolled
Source National STEM Centre, Large Datasets from stats4schools, https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools.
Obtained from https://www.openintro.org/data/index.php?data=smoking
2.1 Read the data from smoking.R or smoking.rda > hint: take a look at source or load functions > there is also smoking.csv file for a refference
# load libraries
library(readr)
library(dplyr)
library(tidyr)
library(data.table)
library(data.table)
library(plotly)
library(lubridate)
library(ggbiplot)
library(caret)
# Load data
load("smoking.rda")
Take a look into data
# place holder
head(smoking)
summary(smoking)
## gender age marital_status highest_qualification
## Female:965 Min. :16.00 Divorced :161 No Qualification :586
## Male :726 1st Qu.:34.00 Married :812 GCSE/O Level :308
## Median :48.00 Separated: 68 Degree :262
## Mean :49.84 Single :427 Other/Sub Degree :127
## 3rd Qu.:65.50 Widowed :223 Higher/Sub Degree:125
## Max. :97.00 A Levels :105
## (Other) :178
## nationality ethnicity gross_income
## English :833 Asian : 41 5,200 to 10,400 :396
## British :538 Black : 34 10,400 to 15,600:268
## Scottish:142 Chinese: 27 2,600 to 5,200 :257
## Other : 71 Mixed : 14 15,600 to 20,800:188
## Welsh : 66 Refused: 13 20,800 to 28,600:155
## Irish : 23 Unknown: 2 Under 2,600 :133
## (Other) : 18 White :1560 (Other) :294
## region smoke amt_weekends amt_weekdays
## London :182 No :1270 Min. : 0.00 Min. : 0.00
## Midlands & East Anglia:443 Yes: 421 1st Qu.:10.00 1st Qu.: 7.00
## Scotland :148 Median :15.00 Median :12.00
## South East :252 Mean :16.41 Mean :13.75
## South West :157 3rd Qu.:20.00 3rd Qu.:20.00
## The North :426 Max. :60.00 Max. :55.00
## Wales : 83 NA's :1270 NA's :1270
## type
## :1270
## Both/Mainly Hand-Rolled: 10
## Both/Mainly Packets : 42
## Hand-Rolled : 72
## Packets : 297
##
##
There are many fields there so for this exercise lets only concentrate on smoke, gender, age, marital_status and highest_qualification
Create new data.frame with only these columns.
# place holder
df1<-smoking
df1<-select(df1,smoke, gender, age, marital_status, highest_qualification,gross_income)
df1
2.2 Omit all incomplete records.
# place holder
df1 %>% drop_na()
str(df1)
## tibble [1,691 × 6] (S3: tbl_df/tbl/data.frame)
## $ smoke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
## $ age : int [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
## $ highest_qualification: Factor w/ 8 levels "A Levels","Degree",..: 6 6 2 2 4 4 2 2 3 6 ...
## $ gross_income : Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
df1=df1[!grepl("Unknown|Refused", df1$gross_income),]
df9<-df1
df1
2.3 For PCA feature should be numeric. Some of fields are binary
(gender and smoke) and can easily be converted
to numeric type (with one and zero). Other fields like
marital_status has more than two categories, convert them
to binary (i.e. is_married, is_devorced). Several features in the data
set are ordinal (gross_income and
highest_qualification), convert them to some king of
sensible level (note that levels in factors are not in order). (5
points)
# place holder
#ohe <- dummyVars(" ~ smoke + gender + marital_status + highest_qualification + gross_income", data #= df1, fullRank = T)
#ohe_df <- data.frame(predict(ohe, newdata = df1))
#ohe_df
#ohe_df <- ohe_df[ - as.numeric(which(apply(ohe_df, 2, var) == 0))]
df2 <- data.frame(
# convert binary from boolean format to numeric
smoker =as.numeric(df1$smoke=="Yes"),
male = as.numeric(df1$gender=="Male"),
age=as.numeric(df1$age),
# marital_status to multiple binary categories:
divorced=as.numeric(df1$marital_status=="Divorced"),
married=as.numeric(df1$marital_status=="Married"),
separated=as.numeric(df1$marital_status=="Separated"),
single=as.numeric(df1$marital_status=="Single"),
widowed=as.numeric(df1$marital_status=="Widowed"))
df2$education <- revalue(df1$highest_qualification, c(
"No Qualification"=0,
"GCSE/CSE"=1,
"GCSE/O Level"=1,
"ONC/BTEC"=1,
"A Levels"=1,
"Other/Sub Degree"=1,
"Higher/Sub Degree"=1,
"Degree"=2
))
df2$gross_income <- revalue(df1$gross_income,c(
"Under 2,600"=0,
"2,600 to 5,200"=1,
"5,200 to 10,400"=2,
"10,400 to 15,600"=3,
"15,600 to 20,800"=4,
"20,800 to 28,600"=5,
"28,600 to 36,400"=6,
"Above 36,400"=7))
#convert to numeric
for(col in colnames(df2)){
df2[col] <- as.numeric(df2[[col]])
}
head(df2)
apply(df2, 2, var)
## smoker male age divorced married separated
## 0.18944461 0.24569834 340.21001038 0.08774748 0.24995302 0.04041983
## single widowed education gross_income
## 0.18593024 0.11201066 0.81168300 5.83401861
2.4. Do PCA on all columns except smoking status. (5 points)
# place holder
#pca_df1 <- subset(df1,select=-c(smoke,marital_status,highest_qualification,marital_status))
pr.out <- prcomp(df2[-1],scale = TRUE)
pr.out
## Standard deviations (1, .., p=9):
## [1] 1.448106e+00 1.277947e+00 1.085353e+00 1.032554e+00 1.007889e+00
## [6] 9.522279e-01 8.542440e-01 6.110390e-01 2.298863e-15
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## male -0.069045898 0.2219838 -0.25632568 -0.306080000 -0.60565274
## age 0.592691108 -0.0571466 -0.01327458 -0.072309268 -0.13601513
## divorced 0.009009441 -0.2277011 0.73052957 -0.419257856 0.13197378
## married 0.136937598 0.7514015 0.00608842 0.008375545 0.04035204
## separated -0.034922482 -0.1198604 0.25780441 0.829196554 -0.29578157
## single -0.484572029 -0.3450197 -0.42461118 -0.104711862 0.06865573
## widowed 0.432758406 -0.4044061 -0.26348416 -0.004631138 -0.08786250
## education 0.421621057 -0.1522312 -0.14456439 -0.046717500 -0.03787620
## gross_income 0.155058229 0.1043154 -0.25165427 0.156536362 0.70305370
## PC6 PC7 PC8 PC9
## male -0.56031284 -0.31537829 -0.07961684 -2.075081e-16
## age 0.01800593 -0.05823202 0.78602250 -2.377316e-16
## divorced -0.29039180 -0.08004172 -0.02602050 3.602669e-01
## married 0.15502032 0.14146485 -0.03384155 6.080456e-01
## separated -0.28151060 -0.03568364 0.05087594 2.445143e-01
## single -0.07024729 0.21385593 0.35283087 5.244233e-01
## widowed 0.28506267 -0.39457239 -0.41155972 4.070396e-01
## education -0.36327024 0.75176413 -0.27826386 -1.390551e-16
## gross_income -0.52911030 -0.32075259 0.01083054 3.939220e-17
2.5 Make a scree plot (5 points)
# place holder
#calculate total variance explained by each principal component
#var_explained = 100* pr.out$sdev^2 / sum(pr.out$sdev^2)
#create scree plot
#library(ggplot2)
#qplot(c(1:19), var_explained) +
# geom_line() +
# xlab("Principal Component") +
# ylab("Variance Explained") +
# ggtitle("Scree Plot") +
# ylim(0, 1)
pr.var <- pr.out$sdev^2
pr.var
## [1] 2.097011e+00 1.633150e+00 1.177992e+00 1.066167e+00 1.015841e+00
## [6] 9.067380e-01 7.297328e-01 3.733687e-01 5.284773e-30
pve <- 100 * pr.var / sum(pr.var)
pve
## [1] 2.330012e+01 1.814611e+01 1.308880e+01 1.184630e+01 1.128712e+01
## [6] 1.007487e+01 8.108142e+00 4.148541e+00 5.871970e-29
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,30),type='b')
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,100),type='b')
Comment on the shape, if you need to reduce dimensions home many would
you choose
The Scree plots have a single elbow and the eigen values level off here.
We need just to Principal componenst as at they in total describe more than 40% of the variance in the dataset.
2.6 Make a biplot color points by smoking field. (5 points)
# place holder
#biplot(pr.out, scale=0, groups=factor(ohe_df$smoke))
ggplotly(ggbiplot(pr.out, scale = 0, groups=factor(df1$smoke)))
Comment on observed biplot.
PCA 1 explains 23.3% of the variability of the dataset and PC2 explains 18.1% of the dataset. Hence the first to principal components explain 41% percent of the variability of the dataset.
In biplot, datapoints are represented as dots and the arrows/vectors contain information on the loadings.
Red dots are the people who smoke while blue dots are the people who dont.
Length of these vectors interpret to how well the variables are represented in the graph and are directly proportional to their variability.
The first to PC interpret well about the features marital_status, but dont perform well about the rest of the features in the dataset.
The angle between 2 vectors shows represents collinearity. Matital_status single and married are strongly negative collinear. Age and education shows the most positive colliearity.
Can we use first two PC to discriminate smoking?
Yes, we can use the first two to disciminate smoking.
2.7 Based on the loading vector can we name PC with some descriptive name? (5 points)
PC1 = PC_marital_status_single
PC2 = PC_Marital_status_married_Gender_male
2.8 May be some of splits between categories or mapping to numerics should be revisited, if so what will you do differently? (5 points)
Looking at the biplot I can tell that the major difference is made by marital_status, that too, there should only be 2 categories among it, first is weather a person is married and second should be everyone else. So if I do the splits agains, I will combine everything other than married together.
2.9 Follow your suggestion in 2.10 and redo PCA and biplot (5 points)
df1<-smoking
df1<-select(df1,smoke, gender, age, marital_status, highest_qualification,gross_income)
df1 %>% drop_na()
df1=df1[!grepl("Unknown|Refused", df1$gross_income),]
df9<-df1
df3 <- data.frame(
# convert binary from boolean format to numeric
smoker =as.numeric(df9$smoke=="Yes"),
male = as.numeric(df9$gender=="Male"),
age=as.numeric(df9$age),
married=as.numeric(df9$marital_status=="Married" ))
df3$education <- revalue(df9$highest_qualification, c(
"No Qualification"=0,
"GCSE/CSE"=1,
"GCSE/O Level"=1,
"ONC/BTEC"=1,
"A Levels"=1,
"Other/Sub Degree"=1,
"Higher/Sub Degree"=1,
"Degree"=2
))
df3$gross_income <- revalue(df9$gross_income,c(
"Under 2,600"=0,
"2,600 to 5,200"=1,
"5,200 to 10,400"=2,
"10,400 to 15,600"=3,
"15,600 to 20,800"=4,
"20,800 to 28,600"=5,
"28,600 to 36,400"=6,
"Above 36,400"=7))
#convert to numeric
for(col in colnames(df9)){
df9[col] <- as.numeric(df9[[col]])
}
head(df9)
apply(df9, 2, var)
## smoke gender age
## 0.1894446 0.2456983 340.2100104
## marital_status highest_qualification gross_income
## 1.6044279 3.8807367 5.8340186
#PCA except smoking
pr.out <- prcomp(df9[-1],scale = TRUE)
pr.out
## Standard deviations (1, .., p=5):
## [1] 1.1741713 1.0237360 1.0055189 0.9550250 0.8063159
##
## Rotation (n x k) = (5 x 5):
## PC1 PC2 PC3 PC4
## gender -0.09489883 -0.7838009 0.04549135 -0.61200669
## age 0.68255652 -0.1365162 0.07092028 0.06803896
## marital_status 0.21074943 0.3842261 -0.69003281 -0.57601092
## highest_qualification 0.66405755 -0.2180501 -0.06234255 0.17769926
## gross_income 0.19930643 0.4145407 0.71614871 -0.50739816
## PC5
## gender -0.005362163
## age 0.711210904
## marital_status -0.004593319
## highest_qualification -0.689940635
## gross_income -0.134577559
#SCREE
pr.var2 <- pr.out$sdev^2
pr.var2
## [1] 1.3786783 1.0480353 1.0110682 0.9120727 0.6501454
pve2 <- 100 * pr.var2 / sum(pr.var2)
pve2
## [1] 27.57357 20.96071 20.22136 18.24145 13.00291
plot(pve2, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,30),type='b')
plot(cumsum(pve2), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,100),type='b')
#Biplot
ggplotly(ggbiplot(pr.out, scale = 0, groups=factor(df9$smoke)))
Get the data set from your final project (or find something suitable). The data set should have at least four variables and it shouldn’t be used in class PCA examples: iris, mpg, diamonds and so on).
sales <- read.csv(file = "car_data.csv")
head(sales)
summary(sales)
## User.ID Gender Age AnnualSalary
## Min. : 1.0 Length:1000 Min. :18.00 Min. : 15000
## 1st Qu.: 250.8 Class :character 1st Qu.:32.00 1st Qu.: 46375
## Median : 500.5 Mode :character Median :40.00 Median : 72000
## Mean : 500.5 Mean :40.11 Mean : 72689
## 3rd Qu.: 750.2 3rd Qu.:48.00 3rd Qu.: 90000
## Max. :1000.0 Max. :63.00 Max. :152500
## Purchased
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.402
## 3rd Qu.:1.000
## Max. :1.000
colnames(sales)
## [1] "User.ID" "Gender" "Age" "AnnualSalary" "Purchased"
df1<-sales
df1<-select(df1,Purchased, User.ID, Gender, Age, AnnualSalary, Purchased)
df1
df1 %>% drop_na()
str(df1)
## 'data.frame': 1000 obs. of 5 variables:
## $ Purchased : int 0 0 0 1 0 1 1 0 0 0 ...
## $ User.ID : int 385 681 353 895 661 846 219 588 85 465 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Age : int 35 40 49 40 25 47 46 42 30 41 ...
## $ AnnualSalary: int 20000 43500 74000 107500 79000 33500 132500 64000 84500 52000 ...
df2 <- data.frame(
Purshase = df1$Purchased=="Yes",
male = df1$Gender=="Male",
age=df1$Age,
salary = df1$AnnualSalary )
#convert to numeric
for(col in colnames(df2)){
df2[col] <- as.numeric(df2[[col]])
}
head(df2)
apply(df2, 2, var)
## Purshase male age salary
## 0.000000e+00 2.499940e-01 1.146414e+02 1.189446e+09
pr.out <- prcomp(df2[-1],scale = TRUE)
pr.out
## Standard deviations (1, .., p=3):
## [1] 1.1030650 0.9751905 0.9122781
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## male -0.4358527 0.8930603 0.1116949
## age 0.6485078 0.2255723 0.7270178
## salary 0.6240754 0.3893077 -0.6774728
pr.var <- pr.out$sdev^2
pr.var
## [1] 1.2167523 0.9509964 0.8322513
pve <- 100 * pr.var / sum(pr.var)
pve
## [1] 40.55841 31.69988 27.74171
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,50),type='b')
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,100),type='b')
ggplotly(ggbiplot(pr.out, scale = 0, groups=factor(df1$Purchase)))
The dataset is from kaggle and use to find if a person will purchase a car or not.
From the scree plot we can tell that PC1 explains 40% of the variance and PC2 explains 31% of the variance. So, in all we two PC's we were able to find tha variability in more than 70% of the dataset.
In the biplot we can see that the green dots are for people who purchased a car and red for people who did not. We can tell that
the most variability is shown by the feature gender, but rest of the loading vectors are almost orthogonal to it. at PC1 it shows
-.43 variability and at PC2 it shows .89 variability. Gender here has been the best feature at the 2 PC's for variability.